Inverse Ising inference using all the data 
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We show that a method based on logistic regression, using all the data, solves the inverse Ising 
problem far better than mean-field calculations relying only on sample pairwise correlation functions, 
while still computationally feasible for hundreds of nodes. The largest improvement in reconstruction 
occurs for strong interactions. Using two examples, a diluted Sherrington-Kirkpatrick model and a 
two-dimensional lattice, we also show that interaction topologies can be recovered from few samples 
with good accuracy and that the use of Zi-regularization is beneficial in this process, pushing inference 
abilities further into low-temperature regimes. 



Introduction: When analyzing systems of interact- 
ing elements, distinguishing direct correlations (caused 
by actual interactions between elements) from indirect 
correlations (induced through chains of interactions via 
other elements) is an intrinsically complex task. Versions 
of this problem come about naturally in biology, sociol- 
ogy, neuroscience and many other fields, and are bound 
to become more and more important as the amount and 
diversity of data on large systems will continue to grow. 
In the Ising model, which has served as a basic starting 
point for studying such situations in applications [lJ-Q , 
a set of binary variables cr = {o~\, ..., ctat}, Cj = ±1, have 
the distribution 
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where Z is the partition function, /3 = 1/T the inverse 
temperature, hi are the external fields and Jij the pair- 
wise couplings. Given magnetizations to, = (ctj) and 



pairwise correlations cy = (tJitJj 



iTTij the probability 



distribution which maximizes the entropy has the Ising 
model form. The standard inverse Ising problem means 
to compute (approximately, efficiently, or according to 
other criteria) the parameters hi and Jij from observed 
mi and c^. The practical interest in inverse Ising, in 
the context of the present and future data-rich world, 
is to use it as an information extraction tool alterna- 
tive and/or superior to measuring correlations. Extend- 
ing the number of states from two to twenty, spectacular 
success has been achieved in inferring directly interacting 
residues (amino acids) in two-component signaling path- 
ways in bacteria 0, Q , and use has also been reported for 
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protein structure prediction [g, |7| . In this Letter we ad- 
dress the following two questions: (i) can one do better 
by keeping all the data for reconstruction and not only 
empirical pairwise correlation functions, and (ii) can such 
a method be implemented in a computationally efficient 
manner? The answer is positive on both accounts, using 
a method inspired by the regularized logistic regression 
process of Wainwright, Ravikumar and Lafferty Q- We 
show in particular that keeping all the data greatly im- 
proves reconstruction of an Ising model in the important 
parameter region of strong interactions. 

Maximum log-likelihood, exponential families and com- 
putability: We will from now on assume that we have B 
independent observations {(t'^}£_ 1 all drawn from (p}. 
The log-likelihood function, given these observations, is 
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and 
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are the empirical first and sec- 



ond moments from B samples. A classical result in 
statistics states that for exponential families of param- 
eter distributions, of which the Ising model ([lj is an 
instance, the averages of the functions multiplying the 
model parameters are sufficient statistics [91-Hlj. This 
means that "no other statistic which can be calculated 
from the same sample provides any additional informa- 
tion as to the value of the parameter" [12;], or, in the 
case at hand, that inference of the biases hi and the in- 
teraction strengths Jij cannot be done better using all 
the B samples (NB data points), than by observing just 

m\ and cL ( — 2 data points). The optimal esti- 
mates (in a maximum likelihood sense) are then given by 

d hi logZ = (3m\ B) and d Jtj logZ = (S^^m^ + cjf >]. 
The problem with this solution is that it is unfeasible to 
compute the partition function exactly in large systems. 
A whole series of approximations, reviewed in [13| . have 



therefore been developed by expanding in high temper- 
ature (small interactions), large external fields or other 
parameters cf. (naive) mean-field (nMF) fli ]. TAP fl4J . 
loop summation [151 ] and have been further extended us- 
ing the fluctuation-dissipation theorem [lq - KL8j . It is well- 
established that all these approximate methods are not 
accurate when the number of samples is small, nor when 
the interactions are strong (temperature is low). How- 
ever, a recent method based on expansion of the sys- 
tem into "clusters" (who's contributions to the estimates 
of {h, J} are included or discarded depending on their 
entropy share) manages to select correctly the parame- 
ters from few samples in various low-temperature settings 
.19], questioning these limitations. 

Pseudo-likelihood maximization (without regulariza- 
tion): The conditional probability of one variable o~ r 
given all the others cr\ r = (ai,...,a r -i,o- r +i, ...,ctn) is 



field. At the true parameters these equations hold, since 



P{h,J}{0- r \(T\r) = J 
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If o~ r by itself is considered a dependent variable, and the 
complementary set cr\ r is taken as independent variables, 
then the maximum likelihood estimates of the parameters 
h r and J r = {Ji r }i^ r , given B samples, minimize 
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Minimizing these functions f r for all r simultaneously is 
not the same as maximizing the total log-likelihood (|2|). 
For example, this procedure, which we call pseudo- 
likelihood maximization, would typically give different 
estimates J*- 1 and J*-' 3 depending on if a or o~j is con- 
sidered the dependent variable. We will for definite- 
ness always take the pseudo-likelihood maximization es- 
timate of the coupling constant to be the average J* ? = 

\ ( 'K? + Jfj 3 ) ■ When the number of samples is large 
we can substitute sample average with ensemble average, 
and write 

fr(KX) ™ ( -H P W}(.*r\<r\r)) ) 

= £ln (l + r^K+IVW) P{hJ}{(7) , 

<T 

(5) 

with equality expected in the limit. Necessary maximum 
likelihood conditions (for one of the conditional proba- 
bilities) are then 
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where the expressions vanish because each state for which 
o~ r = 1 has exactly one opposing state for which a r = — 1, 
contributing equally in size. Assuming this stationary 
point is a minimum we can locate, the pseudo-likelihood 
approach to inferring an Ising model is exact in the limit 
of large sample size, and is in this sense qualitatively 
different from other approximate inverse Ising schemes. 

Pseudo -likelihood maximization with l\-regularization: 
Ravikumar, Wainwright and Lafferty in [8j introduced 
a ^-regularized version of the pseudo-likelihood ap- 
proach, i.e. where the functions to be minimized are 
\f r (h' r , J' r ) + A||J^.||i] with some non-zero penalty pa- 
rameter A. l\ (absolute valu e)-regu larization is widely 
used to recover sparse signals |20l - [22| , in situations where 
a large fraction of parameters is known to be zero, but not 
which ones are. The numerical minimization can be done 
efficiently using convex programming, such as the inte- 
rior point method of Koh, Kim and Boyd [23], which we 
have used below. If the goal is to recover the interactions 
Jij as such, then the Zi-regularization only introduces a 
bias and a reconstruction error. If on the other hand the 
goal is to find which interactions are non-zero, and their 
sign, and if the interaction graph is known to be sparse, 
then Zi-regularization is an important tool. For Ising 
models without external fields, h = 0, [§[ establishes de- 
tailed conditions on A and the scaling parameters (B, N, 
maximum node degree of the underlying graph d, min- 
imum value of non-zero interactions) for complete such 
sign-sparsity retrieval with high probability. 

Results for high-quality data: Assuming we can find the 
discussed minimum of Q, the estimator is consistent and 
the fitting is essentially only limited by imperfect sam- 
pling (noisy data). We examine numerically the algo- 
rithm's performance for large values of B (using A = 0) 
in the setting of the dilute Shcrrington-Kirkpatrick (SK) 
model |24| . Every Jy is thus non-zero with probabil- 
ity p, and if so drawn from a Gaussian distribution with 
zero mean and variance 1/c, c = pN . External fields are 
assumed zero. Reconstruction error is measured by 
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and similarly for the variation with respect to an external 



Without regularization, the minima can be found using, 
for example, a standard Newton method. Figure Q] shows 
simulation results for N — 64 compared to naive mean- 
field (nMF) i.e. J" MF = -4 (c^ 1 ^.. The curves are 
the averages of five different runs (error bars are small 
enough to be omitted). MC sampling (with a basic 
acceptance/rejection updating rule) was performed us- 
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FIG. 1. (Color online). Reconstruction errors of pseudo- 
likelihood maximization (PLM) and nMF versus temperature 
for a) fully (p = 1) and b) sparsely (p — 0.1) connected SK- 
systems of size N = 64. The number of MC samples used are 
10 6 (dotted), 10 7 (dashed) and 10 8 (continuous). 



FIG. 2. (Color online), a) Edge agreement versus sample size 
in a binary SK model of size N — 100 and sparsity p = 0.05 for 
PLMs and PLM 5 ,\. T = 2 for all data points, b) Probability 
of 100% edge agreement versus inverse temperature for PLMs 
and PLMs,\ using B = 4500 on 7x7 nearest-neighbor grids 
(TV = 49) with 30% dilution. 



ing a warmup time of 10 7 • N spin updates and a sam- 
pling frequency of one observation every 10 ■ N updates. 
Evidently, pseudo-likelihood maximization outperforms 
nMF in the low temperature region. As T approaches 
one from above (towards the spin glass phase), the naive 
mean-field method gives poor results, while our logistic 
regression algorithm appears unaffected. Lowering the 
temperature further to T — 0.5, where nMF and indeed 
all approximate methods tested on this example to date 
are unusable, pseudo-likelihood maximization continues 
to function adequately. On the other hand, as the tem- 
perature increases, states become more equiprobable and 
greater sample sizes are required to extract relevant in- 
formation about the parameters, resulting in the joining 
of the curves of the two methods at high T. Performance 
is at that point limited by the finiteness of B rather than 
by method choice. Moreover, one can observe that the 
decrease of A seems to follow ~ -j= . Finally, the switch 
to sparse J clearly worsens the performance of nMF, but 
does not seem to affect the pseudo-likelihood scheme. 
The results for system sizes N = 16 and N = 128 are 
similar (data not shown). 

Results for low-quality data: Rebuilding the sign- 
sparsity pattern of J from few samples using the pseudo- 
likelihood maximization (PLM) idea has been done nu- 



merically for various sparsity types in [8j and |25j . We 
provide here some additional results, specifically regard- 
ing the advantages of using a regularization term. Taking 
A > after all makes the optimization problem consider- 
ably harder computationally. A simpler approach would 
be to minimize (U) with A = and declare all couplings 
for which | Jy | < 8 to be zero (for some tolerance 5) . Intu- 
itively, inclusion of a regularization term should allow for 
better utilization of sample information than the simpler 
tolerance approach. As a test case we look at a version of 
the SK model where the couplings are not Gaussian but 
binary, Jy = ±-7^ (with equal probability). The infer- 
ence quality is measured as the percentage of pairs (i, j) 
where the interaction strength is identified correctly as 
" + ", "0" or " - ". PLM using tolerance only and PLM 
using regularization (as well as a tolerance limit) will be 
referred to as PLMs and PLM$ : \ respectively. Figure 
Hi shows that for N = 100, p = 0.05 and T = 2, PLM s ,x 
fits the edges more accurately and gives perfect recon- 
struction for fewer samples than PLMs- Note that in 
this example guessing J*j — for all pairs would result 
in a 95% edge agreement on average. Optimal values of 
5 and {5, A} for each B were determined empirically and 
used on 20 new parameter sets to yield the averages. 
For several sparsity structures the performance of PLM 



has been shown to drop as the t emp erature goes below 
some T crit even if B is quite large [25[ . One such example 
is B — 4500 on 7x7 nearest-neighbor grids with positive 
couplings, where each edge in the grid is removed with 
probability 0.3 and the remaining couplings are set to 
one. The "failure" occurs close to the known critical 
point for the Ising model on such grids [25|, j3 cr it ~ 0.7 
[26j . We applied PLMs,\ and PLMs to this problem to 
see whether combined regularization-tolerance can boost 
performance at low temperatures. Figure [5b shows the 
outcome, where optimal 6 and {<5, A} for each /3 were 
again found empirically and probabilities estimated us- 
ing 200 new grids. A breakdown is indeed seen for PLM$ 
around j3 = 0.7, but the effect on PLM$^\ is less pro- 
nounced if there at all. Perfect edge recovery, using 
the latter, is had with high probability far into the low- 
temperature region. The complete data output (not re- 
ported) shows that including the tolerance threshold in 
PLMs,\ (as opposed to trusting the regularization term 
alone to force suitable estimates of Jij to zero), becomes 
necessary at low temperatures. MC samples in this case 
were generated using a warmup time of 10 7 • N spin up- 
dates and a sampling frequency of one observation every 
2000 • N updates. 

Discussion: Our results suggest that the pseudo- 
likelihood approach allows for accurate inference in Ising 
models even for large strongly coupled systems. The 
method relies on utilization of complete data sets, im- 
plying that the Ising model is considered a model to fit 
to data as such, and not as a maxentropy model based 
on means and correlations. 

Our results also confirm that including an l\- 
rcgularization term is helpful in retrieving sign-sparsity 
from few samples, allowing for complete graph recon- 
struction even in low temperature regions. A tolerance 
threshold in combination with ^i-regularization seems to 
be necessary to get the best results. It is reasonable 



that heavy regularization may run into problems easier 
than would milder regularization followed by a tolerance 
limit. We also add that the code employed at A > 
includes estimates of the external fields, thus not utiliz- 
ing the knowledge that (in our example) h = 0. It is 
quite possible that even better results can be obtained if 
the algorithm optimized without fields in the likelihood 
functions, especially when fitting from few samples. 

The time required to solve the N logistic regression 
problems is not insignificant. For the N = 64 cases with 
10 8 samples it takes hours on a standard home PC us- 
ing Newton decent. However, we also applied a quasi- 
Newton version where the Hessian was approximated us- 
ing only every 100 t/l or even 1000" 1 sample since the main 
hurdle is evaluating the Hessian of the objective func- 
tion, which depends on all 10 8 samples. This version of 
the procedure located all the minima successfully in a 
fraction of the time. Also, several algorithms applicable 
for logistic regression who are typically much faster than 
Newton decent are available, such as other quasi-Newton 
and Conjugate Gradient methods. In the few-sample sec- 
tion, computations naturally run much faster and time is 
not as big an issue. Therefore, in a region where pseudo- 
likelihood maximization is particularly interesting (small 
sample size), it is also computationally efficient and com- 
petitive. 
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